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Abstract — Calibration in a multi camera network has widely 
been studied for over several years starting from the earlier days 
of photogrammetry. Many authors have presented several 
calibration algorithms with their relative advantages and 
disadvantages. In a stereovision system, multiple view 
reconstruction is a challenging task. However, the total 
computational procedure in detail has not been presented before. 
Here in this work, we are dealing with the problem that, when a 
world coordinate point is fixed in space, image coordinates of that 
3D point vary for different camera positions and orientations. In 
computer vision aspect, this situation is undesirable. That is, the 
system has to be designed in such a way that image coordinate of 
the world coordinate point will be fixed irrespective of the 
position & orientation of the cameras. We have done it in an 
elegant fashion. Firstly, camera parameters are calculated in its 
local coordinate system. Then, we use global coordinate data to 
transfer all local coordinate data of stereo cameras into same 
global coordinate system, so that we can register everything into 
this global coordinate system. After all the transformations, when 
the image coordinate of the world coordinate point is calculated, 
it gives same coordinate value for all camera positions & 
orientations. That is, the whole system is calibrated. 

I. Introduction 

Several camera calibration mechanisms have presented by 
various authors with their relative advantages and 
disadvantages. However, all these algorithms concentrate on 
extracting camera parameters. In the model proposed by Aziz 
& Kararafl], parameters are extracted in an elegant fashion, 
though, they neglected lens distortion of the cameras. Later on, 
a model proposed by Tsai[2] opened a new door to form the 
basis of accurate camera calibration. But Tsai's algorithm also 
suffers from non-linear search, depends on the data precision. 
Homography based methods firstly proposed by Zhang[3] is a 
modern method and makes use of advanced projective 
geometry, but here also lots of computational effort is need at 
initial stage. Besides these traditional methods, a variety of 
algorithms have been presented in several literatures. 

Manuscript Received June 30, 2010 for review. 

Ayan Chaudhury & Abhishek Gupta are with the department of Computer 
Science & Engineering, University of Calcutta, 92 A.P.C. Road Kolkata- 
700009, India. (Phone: +91 9830783652 and +91 9143008407 ; email: 
ayanchaudhury.cs@gmail.com and abhishek_guptall8@yahoo.com) 

Sumita Manna and Subhadeep Mukherjee are with the A.K.Choudhury 
School of Information Technology, University of Calcutta, 92 A.P.C. Road 
Kolkata-700009, India. (Phone: +91 9230272856 and +91 8013526896 ; 
email: sumimcal5@gmail.com and soumol2@gmail.com) 

Amlan Chakrabarti is with the A.K.Choudhury School of Information 
Technology, University of Calcutta, 92 A.P.C. Road Kolkata, India (phone: 
091-33-23500289/+91 9831129520; fax: 091-33-23519755 ; email: 
acakcs@caluniv.ac.in) 



But in this paper, our aim is not to demonstrate a new 
calibration method. We are concentrating on the problem of 
multiple view reconstruction in a multi camera network, where 
we will make extensive use of the traditional models described 
so far for the calibration of cameras. 

Let, the same static scene be imaged by two cameras C & C 
as shown in figure 1. 

x=<x, r,z) = (_x", y'.z'i 




Fig. 1 Imaging of a static scene by two cameras 

These could be two physically separate cameras or a single 
moving camera at different positions. Let the scene 
coordinates of a point X in the C coordinate system be (X,Y,Z) 
and in the C coordinate system be (X',Y',Z'). We denote 
corresponding image coordinates of X in image plane P and P' 
by u = (x,y) and u' = (x',y'). The points u and u' are said to be 
corresponding points. 

Hence, the same world coordinate scene is mapped at 
different image coordinates. Our aim is to relate corresponding 
points mathematically by an explicit one-to-one mapping. That 
is, after calibration of the cameras, for same world coordinate 
scene, every camera will image to the same pixel. And this 
should be true for an arbitrary number of cameras in the 
network. 

For simplicity matters, we restrict to a single calibration 
object in the static scene and would verify our result on that 
point. 

II. PROPOSED MODEL 

Figure2 shows the proposed calibration method. In our 
method, we have concentrated on our ultimate goal to 'unify' 
different camera views. To keep matters simple we have 
neglected lens distortions, though it can be incorporated using 
the technique proposed by Shih[4]. 

In the first phase of our work, we have generated a calibration 
pattern having 19 calibration points, whose coordinates in 
local and global coordinate system are being measured. 



Build the calibration pattern which will act as the object in 
3D world coordinate system. Measure calibration dots in 
the local coordinate system of the calibration box & from a 
global coordinate system. 



Create stereo camera arrangements according to the room 
configuration and requirement of the system 



Take image from 
leftmost camera 

* 



Take image from 
rightmost camera 

I 



Calibrate each camera. Compute transformation matrices of 
each camera in its own local coordinate system using 
Singular Value Decomposition (SVD) method. 



Obtain rotation and translation matrices in the global 
coordinate system. This is to be done for all calibration 
points on the calibration object, in a least square fashion 



Using the computed rotation and translation matrices in 
global coordinate system, transform each camera matrices 
into global coordinate system to register everything in same 
global coordinate system 



Perform feature point extraction of the calibrated images 
using standard object detection methodologies 



Generate computed image points using each camera's 
transformed matrices. This will be the calibrated & unified 
image for all the cameras where object position is fixed. 



Generate computed image points using each camera's 
transformed matrices. This will be the calibrated & unified 
image for all the cameras where object position is fixed. 



Fig. 2 A scheme of the proposed calibration method 

Out of these dots, eight are used in extracting camera 
parameters and 18 are used for computing global coordinate 
data. Firstly, from each camera position snapshots are taken 
and pixel coordinates of predetermined six points are 
measured. Then from 3D-2D coordinates, camera parameters 
of all cameras in their own local coordinate system are 
computed. Then we use global coordinate data to transfer all 
local coordinate data to same global coordinate to register 
everything into same global coordinate system. 

We then perform some morphological image processing 
operations to extract feature points from each image taken by 
the stereo cameras. This step is necessary for further 
processing with the images. After all these operations are done, 
image coordinates as well as pixel coordinates of the fixed 



calibration point in space is computed. It is seen that although 
image coordinates of a fixed world coordinate point varies for 
different camera positions and orientations, we have calibrated 
the system in such a way that image coordinate of this point 
does not vary with camera positions and orientations. That is, 
the whole system is being calibrated. 

III. Mathematical Formulation 

Now we present the whole mathematical formulation of the 
system. The first phase consists of extraction of camera 
parameters and next phase is coordinate system transformation. 

A. Extraction of camera parameters 

In order to express an arbitrary object point in the world 
coordinate system, we first need to transform it to camera 
coordinates. This transformation consists of a translation and a 
rotation. Let, a world coordinate point be represented as 
Pw(x w ,y w ,Zw) an d corresponding camera 3D coordinate be 
C(x,y,z). Then the transformation from 3D world to 3D 
camera coordinate can be represented as: 
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Let us denote the following: 

X = RnX w + R i2 Y w + R i3 Z w + T x 
Y = R 2 iX w + R 2 2Y W + R 23 Z W + T y 



(1) 
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If f be the focal length of the camera, then from pinhole 
geometry we can write, 

x =i\ & y =f Y - (2) 

The intrinsic camera parameters usually include the effective 
focal length f, scale factor s u , and the image center (u , v ) also 
called the principal point. Here, as usual in computer vision 
literature, the origin of the image coordinate system is in the 
upper left corner of the image array. The unit of the image 
coordinates is pixels. 

Let, (x im , y im ) be the pixel coordinates & (O x , O y ) be the 
optical center. If we consider scaling along x and y directions 



and S y then the following equation can 



are considered as S x 
be written: 

x = (x im - O x ) S x & y = (y im - O y ) S y 
Hence, from last two equations we can write, 

Ll = - ™ 0. f Y 

s x z 

which implies, x ir 
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& yim = a v - + O y 



(3) 

(4) 
(5) 



where a u and a v are considered as parameters for scaling in 
x and y directions respectively. 

The pinhole model is only an approximation of the real camera 
projection. It is a useful model that enables simple 



mathematical formulation for the relationship between object 
and image coordinates. However, it is not valid when high 
accuracy is required and therefore, a more comprehensive 
camera model must be used. Usually, the pinhole model is a 
basis that is extended with some corrections for the 
systematically distorted image coordinates. To keep things 
simple, we neglect lens distortion. 

So, in the first stage of the mathematical formulation, we have 
done the transformation : from world coordinate to pixel 
coordinate. Now, our aim is to compute the parameters. So, we 
will use Direct Linear Transformation(DLT) methodfl]. The 
DLT method is based on the pinhole camera model and it 
ignores the nonlinear radial and tangential distortion 
components. 

In this method, a linear transformation equation can be written 
to map world coordinate (x w ,y w ,z w ) to pixel coordinate (x,y) 
as: 
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which can be written as, 
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where, a is a parameter and 
rn 12 

M= m 21 m 22 
\m 31 m 32 



M 



by 



M int M e 



(8) 



is the calibration matrix given 
m 13 m 14 \ 

m 23 m 24 (7) 
m 33 m 34 J 

which consists of camera internal and external parameters. 
Hence, equation (6) can be rewritten as, 

i 

Where M int represents the matrix containing internal 
parameters and M cxt represents the matrix containing external 
parameters. 

1) Solving Through Least Square Approach: Upto this 
point, we have only estimated the equation for solving intrinsic 
& extrinsic camera parameters. Now we have to solve 
equation for obtaining M, the calibration matrix. 
Let's refer to the equation (6). We can obtain value of a as 

a = m 31 x w + m 32 y w + m 33 z w + m 34 
Substituting this value back in the equation & rearranging the 
equations, we obtain 

m 11 x w + m 12 y w + m 13 z w + m 14 - x m 31 x w - x m 32 y w - x 
m 33 z w - x m 34 = (9) 
m 21 x w + m 22 y w + m 23 z w + m 24 - y m 31 x w - y m 32 y w - y 
m 33 z w - y m 34 = (10) 
So, we have two equations and 12 unknowns (mn,.. .,11134). 
These equations can be solved in an elegant fashion. We can 
make the system as a overdetermined system of linear 
equations and use the property of statistical model fitting. Here 
we have taken 8 calibration points, whose 3D-2D coordinates 
are known. However, many more correspondences & 
equations can be obtained and M can be estimated through 
least square technique. If we assume we are given N matches 



for the homogeneous linear system, we have the following 
equation: 

La = (11) 
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and a = [ m„,m 12 ,..., m 3 3,m 3 4] 1 

We can estimate the parameters in a least square fashion. In 
order to avoid a trivial solution m u ,..., m 34 = 0, a proper 
normalization must be applied. Abdel-Aziz and Kararafl] used 
the constraint m 34 = 1 . Then, the equation can be solved with a 
pseudoinverse technique. The problem with this normalization 
is that a singularity is introduced, if the correct value of m 34 is 
close to zero. Faugeras & Toscani[5] suggested the constraint 
a 3 i 2 + a 32 2 + a 33 2 = 1 which is singularity free. 

2) Extraction of Parameters through Singular Value 
Decomposition (SVD): So far we have structured the equation 
containing DLT matrix and structured it through least square 
technique. But, the main problem seem to be unsolved till 
now : extraction of parameters from M. 

In an overdetermined system, solution can be get through SVD. 
In this method, a singular matrix can be partitioned as: 

A = U W V T 

& because vector V gives the solution corresponding to 
smallest eigenvalue, it gives the real solution. Similar case we 
apply for matrix L. That is, by singular value decomposition of 
matrix L, we obtain matrix M. Then we will extract parameters 
from M. 

It can be seen from (8), M consists of the combination of 
internal & external calibration matrices. There are techniques 
for extracting some of the physical camera parameters from 
the DLT matrix, but not many are able to solve all of them. In 
order to extract parameters from DLT matrix, we prefer the 
system proposed by Melen[6] where he had done QR 
decomposition to obtain the following : 

M = X V" 1 B _1 FRT (12) 
Where X is a overall scaling factor and R , T are the rotation 
and translation matrices from the object coordinate system to 
the camera coordinate system. Matrices V , B and F contain 
the focal length f, principal point (u , v ) and coefficients for 
the linear distortion (b b b 2 ): 
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Combining expression for F with X (refer to equation 5) we 
can write: 
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(13) 



Hence, equation (13) represents the matrix representing 
camera internal parameters as far as we are neglecting lens 
distortion coefficient. Hence, we can write the following: 
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Multiplying & from (8) we write expression for M as 

M = 
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After this stage, we will impose constraint norm(m3) = || m3| 
= 1 and m 34 = 1 & write equation for M as vector form as: 



From the above equation, clearly we have, 

T z = m 34 and r 3 = m 3 T (16) 
Now we compute following dot and cross products to get a 
solution: 

ml . m 3 = (a u r r + u r 3 ) . r 3 = a u r r . r 3 + u r 3 . r 3 = u (17) 
Similarly, m\ . m 3 = v (18) 

Now, to compute a u and a v we do the following : 



m\ . m, = ( g tt r 1 + u r 3 ) . (a„ri + u r 3 )= , 

a u = y jm[m 1 — Uq & a v = ^m\m 2 — v. 
Again, from equation (14), we can write the following: 

a u r l + U T 3 = m l => a u r l = m l ~ U m 3 

Tl = (ml -u ml)/a u 
r 2 = (ml -v ml)/a v 
a u T x +u T z = m 14 we obtain 
T x = (m 14 - u m 34 ) la u 
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Now, R obtained in this way may not be orthogonal so that we 
compute another decomposition & get R = UDV T & do 
R = U * V T 



B. Coordinate System Transformation 

So far we have calibrated each camera and extracted 
parameters in their own local coordinate system. Now, in order 
to achieve our goal to calibrate a point in space irrespective of 
the camera positions, we follow the approach proposed by 
Xiong & Queck[7]. In this approach, after calibrating all 
cameras, we need transfer all of these local coordinate systems 
into a global coordinate system, so that we can register 
everything into the same coordinate system. During the 
calibration process, besides taking calibration points for 
extracting parameters, we have measured coordinates of 18 



points in local coordinate system and global coordinate system. 
The aim is to mathematically calculate rotation and translation 
matrices which can transform all calibration points to global 
coordinate system. Then we will apply these matrices to all 
camera matrices to register everything in same global 
coordinate system. 

Let Y denote the global coordinate system X denote the local 
one. Then transformation from X to Y can be done through the 
following equation: 

Y = RX + T (24) 
where R is the rotation matrix and T is the translation matrix. 
Suppose we have n global calibration points & their positions 
are: 

X = (X,X 2 ... X n ) T 
where X! = (x n x 12 x 13 ),..., X n = ( x nl x n2 x n3 ) 
Similarly for Y coordinate we also have, 

Y = ( Yj Y 2 ... Y n ) T 
where Yi = (y n y 12 y u ) , ... , Y n = (y nl y n2 y n3 ) 

From above equations we have, 
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Above equation can be expressed as: 

AZ = b (26) 
This is an over-determined problem. We use least square 
approach to solve it,i.e., 

e = Zl=iZ?^(& tt -2:?=iayZ yfc ) 2 = ||b- AZ HI (27) 
By minimizing eqation (27) we get R and with equation (24) 
and Xi , Yi we can obtain T as 

T = Yi-RXi (28) 

Once we get R and T , we transform local coordinate system to 
global coordinate system. 



IV. Experimental Result 

After all the theoretical work-ups, we have done the 
experimental verification. We have made a small experimental 
setup, where a white sheet of paper glued into a board having 
black dots painted on it acted as the object. Four snapshots are 
taken from different positions. The images are shown in Fig2. 
The coordinates of calibration points are measured in local & 
global coordinate system. Corresponding 2D pixel coordinates 
are also measured. Then the total computation is simulated in 
MATLAB and camera parameters in each camera's local 
coordinate system are obtained. We have not considered lens 
distortion into consideration. But this has not hampered our 




(a) (b) (c) (d) 

Fig.2 Snapshot of calibration object from four camera positions 

result largely. After back projection of matrix M of each 
cameras to 2D pixel coordinates, the mean error calculated is 
shown in Tablel. 



TABLE I 

Mean Errors for Each Camera Position 



Camera positions 


Mean error in x- 


Mean error in y- 




direction 


direction 


1 


-0.0259 


-0.1195 


2 


-0.0861 


-0.0741 


3 


0.0213 


-0.0378 


4 


-0.8733 


-0.8122 



Through least square approach we have computed rotation 
& translation matrices in global coordinate system and using 
these values transformed R and T matrices of each 
camera(which are in their own local coordinate system) to 
same global coordinate system. 

So, we have registered everything into same reference frame. 

1) Feature Point Extraction: So far we have performed 
mathematical transformations to compute desired result. Now 
we perform one of the basic tasks in a stereovision system: 
feature point extraction through object detection. We have 
done it through some morphological image processing 
operations. The processed images are shown in Fig(3a-e). 




(d) (e) 

Fig.3 Processed images in performing object detection 



Here we are showing processing of one image. In the sequence 
of images, the first one is the binary version of the colour 
image. Next, we've inverted it(image (b) in the sequence). 
After then, the background is extracted in third image and in 
fourth of the sequence, background is being subtracted from 
inverted one. At this stage we've got detected object with 
some redundancies (unexpected object-like features). So 
we've performed another step to detect the object from this 
image(determining pixel connectivity). The last image in the 
sequence is the final one where white calibration 
points are detected. We have done same operation for other 
images also. The finally processed images are shown in 
Fig4(a-d). 




(a) (b) (c) (d) 

Fig. 4 Sequences of images after object detection , taken from four camera 
positions 

2) Verification of the results: Finally we are left with 
getting the calibrated image coordinate, where the verification 
will be done, i.e., getting same image coordinate values for 
different camera positions. 

After the transformation from all camera's parameters from its 
local coordinate to same global coordinate, we have computed 
image coordinate from each camera's data and with no 
surprise, after rounding off we have got same image 
coordinate values for all camera! The coordinate value got in 
the computation is (158,88). 

Hence, for a single calibration point, we have successfully 
calibrated the system. The same simulation can be done for all 
points in the image. For simplicity we have considered a single 
point. 

In the next step we have converted image coordinate to pixel 
coordinate by the following formula as from[2]: 

Xf = S x d' x Xd + C x 

Yf = dy 1 Yj + Cy 

where C x and C y are the row and column numbers of the centre 
of computer frame memory , d x is the centre to centre distance 
between adjacent sensor elements in X (scan line) direction , 
d y is centre to centre distance between adjacent CCD sensor in 
the Y direction , S x is the uncertainty factor. 
Here we have made some sort of simplifications. We have 
neglected the effect of uncertainty factor S x . According to 
Tsai[2], this factor is introduced due to a variety of factors 
such as slight hardware timing mismatch between image 
acquisition hardware and camera scanning hardware etc. 
Because we are doing the experiment with a single camera, we 
can consider that the effect will be same for all images, 
yielding same pixel value. 



The image center in the pinhole camera model is the point in 
the image plane at the base of the line that is perpendicular to 
the image plane, passing through the focal point. Some 
interesting facts about image center approximation can be 
found from Tapper[8]. Tsai initially recommended using 
center of the frame buffer as a reasonable estimate. After 
further experimentation, he found that the image center in 
modern CCD cameras often varies so widely from this 
estimate that accurate calibration is impossible. 
According to Tapper[8], image center can be investigated as 
finding the orthocenter of the vanishing points of three 
orthogonal sets of parallel lines. However, this process is very 
cumbersome and is not accurate enough. So, we have decided 
to take the image center as the mid point of the pixel array, 
that is, for a M x N image, the image center is taken as the 

point (—,-). Here we have taken a (1000 x 1 100) image, so 




Fig. 5 Final Processed Image 

that we have taken image center as the point(500,550). 
After estimating everything, we have implemented the 
calibration object to pixel coordinate and reconstructed the 
other points through relative pixel shift. The final image is 
Shown in Fig(5). 

V. Conclusion 

Geometric camera calibration is one of the basic tasks in 
multi-camera systems in computer vision aspect. The character 
of the problem determines the requirements of the calibration 
method. In these type of systems, very high accuracy is needed. 
However, it is one of the most challenging task to achieve high 
accuracy because in getting high accuracy, measurements 
should be done very accurately. 

In this paper a simple approach for multiple view 
reconstruction has been presented. However, according to the 
need of the problem more accurate & complex formulation can 
be done where each object in the scene will be calibrated for 
the reconstruction. But everything can be computed keeping 
our approach as the basis of computation. 
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